As governments worldwide are fast-tracking initiatives for the public most of the health safety, it's also important how effectively the initiatives are implemented. In those initiatives, the highest priority step is rapid testing. But because of the testing facilities and minimal knowledge of the virus strain, it has become one of the major reasons for wild spread.
The author is presenting some finding, based on the results of laboratory tests commonly done for a suspected COVID-19 case during visit to Emergency Room (ER) to show how an analytical approach caters for the given situation.
The case study gives a high-level view of how a comprehensive data-driven decision-making approach helps in tackling this pandemic, Where a multi Layer perceptron model is used (5 layers deep neural network with 6,12,18,24 and 30 neurons in each layer respectively) & explained of how this has resulted to be the best fit model for the given dataset.
As we know the neural Network model such as a black box when it comes to interpretability. To make the analytical decision-making process approach Fair, accountable and Transparent, LIME interpretability approach is implemented, which enables the model to assist decision making for health-care professional and doctors during COVID-19 clinical test analysis.
Overall the analytical approaches cater COVID-19 frontline workers to help with getting accurate test results by supporting the minimal clinical test to interpret the result.
#Import Library
from io import StringIO
import pandas as pd
import numpy as np
from numpy import where
#for visualization
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import pyplot
from matplotlib import cm
from matplotlib.pyplot import figure
import seaborn as sns
sns.set()
#for pydot and graphviz
from IPython.display import Image
#Metrics
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
# Machine Learning libraries
from sklearn.neural_network import MLPClassifier #For Multi-Layer Perceptron
from sklearn.model_selection import GridSearchCV #for GridSearch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder, StandardScaler,MinMaxScaler,LabelEncoder # very important for feature transformation
#Logistic Regression
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
#Explanibility
import lime
from lime import lime_tabular
# # #Ignore the Warnings
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_rows', 120) # Display upto 120 rows
- As a first step, understanding the data and necessary normalisation and filtering process was done.
#Import Data
Data=pd.read_csv('Diagnosis_of_COVID-19_and_its_clinical_spectrum.csv')
# Shape of the dataset
print(" The data set has total of "+str(Data.shape[0])+" entries and "+str(Data.shape[1])+" features")
- Note: Here the feature represents the different clinical test, the same analogy will be used throughout the process.
Data.head() #How the input data looks like
# #Undertanding the data distribution
# #Proportion of data
df2 = Data.copy()
df2 = df2.rename(columns={"Patient ID":"Patient_iD"})
proportion=(df2['SARS-Cov-2 exam result'].value_counts()/df2['SARS-Cov-2 exam result'].count())*100 # Calculating ratio of Patients tested postive and negative in the DataSet
print(proportion)
# fig1, ax1 = plt.subplots()
# ax1.pie([proportion], labels=['Negative cases', 'Positive cases'], autopct='%1.1f%%', startangle=90)
# ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
# plt.show()
The above pie chart depicts that the majority of data represents the negative cases, Only 10% of total data has Positive cases logged.
print("Overall this dataset has "+str(round(100*df2.isna().to_numpy().sum()/(df2.shape[0]*Data.shape[1]),2)) + "% of missing values")
# Perventage of missing value for each features - Checking the data for missing value
df_null_pct = df2.isna().mean().round(4) * 100 # Calculate mean percentage value of NaN values in each features (and multipled by 100 to get a understnding interms of percentage)
df_null_pct.sort_values()
The above list shows us that there are a lot of features which has a greater amount of missing value
As the next step author has taken approaches to deal with the missing data, selecting the features and get better positive and negative case proportion. As of now we can see the input data is banked highly towards negative cases.
Each feature has different types values converting the values to the appropriate data type with handling missing values are done in the below section.
# Replacing categorical string names with values using masking
mask = {'positive': 1,'negative': 0,'detected': 1, 'not_detected': 0,'not_done': np.NaN,'Não Realizado': np.NaN,'absent': 0,'Não Realizado':np.NaN, 'present': 1,'detected': 1, 'not_detected': 0,'normal': 1,
'light_yellow': 1, 'yellow': 2, 'citrus_yellow': 3, 'orange': 4,'clear': 1, 'lightly_cloudy': 2, 'cloudy': 3, 'altered_coloring': 4,'<1000': 1000,'Ausentes': 0, 'Urato Amorfo --+': 1,
'Oxalato de Cálcio +++': 1,'Oxalato de Cálcio -++': 1, 'Urato Amorfo +++': 1}
df3 = df2.copy()
df3 = df2.replace(mask)
df3.head()
pcs_vars = {'respiratory': ['Influenza B', 'Respiratory Syncytial Virus', 'Influenza A',
'Metapneumovirus', 'Parainfluenza 1', 'Inf A H1N1 2009',
'Bordetella pertussis', 'Chlamydophila pneumoniae', 'Coronavirus229E',
'Parainfluenza 3', 'CoronavirusNL63','Parainfluenza 4',
'Rhinovirus/Enterovirus', 'CoronavirusOC43', 'Coronavirus HKU1'],
'regular_blood': ['Monocytes','Hemoglobin', 'Hematocrit',
'Red blood cell distribution width (RDW)', 'Red blood Cells',
'Platelets', 'Eosinophils', 'Basophils', 'Leukocytes',
'Mean corpuscular hemoglobin (MCH)', 'Mean corpuscular volume (MCV)',
'Lymphocytes'],
'influenza_rapid': ['Influenza B, rapid test', 'Influenza A, rapid test']}
X_df =df3[['Patient age quantile']+['SARS-Cov-2 exam result'] + pcs_vars['regular_blood']+ pcs_vars['influenza_rapid'] +pcs_vars['respiratory']] # Selected varibales for training the model
#percentage of missing values in positive case
dataset_positive = X_df[X_df['SARS-Cov-2 exam result'] == 1]
total = dataset_positive.isnull().sum().sort_values(ascending=False)
percent = (dataset_positive.isnull().sum()/dataset_positive.isnull().count()).sort_values(ascending=False)
missing_data_positive = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data_positive.head()
#percentage of missing values in negative case
dataset_negative = X_df[X_df['SARS-Cov-2 exam result'] == 0]
total = dataset_negative.isnull().sum().sort_values(ascending=False)
percent = (dataset_negative.isnull().sum()/dataset_negative.isnull().count()).sort_values(ascending=False)
missing_data_negative = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data_negative
#drop features from positive cases with more than 86% missing value
columns_to_exclude = missing_data_positive.index[missing_data_positive['Percent']> 0.86].tolist() #more than 86% missing value
X_df.drop(columns=columns_to_exclude, inplace=True)
print(columns_to_exclude)
X_df
# Redefine dataset positive and negative
dataset_negative = X_df[X_df['SARS-Cov-2 exam result'] == 0]
dataset_positive = X_df[X_df['SARS-Cov-2 exam result'] == 1]
dataset_negative = dataset_negative.dropna(axis=0, thresh=5) #minimum 5 features should be not be a NA value
DN=dataset_negative.fillna(dataset_negative.mean())
DP=dataset_positive.fillna(dataset_positive.mean())
#concatinate positive and negative cases together
X = pd.concat([DN, DP])
nof_positive_cases = len(dataset_positive.index)
nof_negative_cases = len(dataset_negative.index)
fig1, ax1 = plt.subplots()
ax1.pie([nof_positive_cases, nof_negative_cases], labels=['Positive cases', 'Negative cases'], autopct='%1.1f%%', startangle=90)#, colors=['#c0ffd5', '#ffc0cb'])
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
X #Dataset after all the data-preprocessing
Each feature in the dataset has a different correlation in the set of features, so let's see how feature correlation is
distributed over the given dataset.
corrmat = abs(X.corr())
# Correlation with output variable
cor_target = corrmat["SARS-Cov-2 exam result"]
# Selecting highly correlated features
relevant_features = cor_target[cor_target>0.05].index.tolist()
# f, ax = plt.subplots(figsize=(16, 8))
# sns.heatmap(abs(X[relevant_features].corr().iloc[0:1, :]), yticklabels=[relevant_features[0]], xticklabels=relevant_features, vmin = 0.0, square=True, annot=True, vmax=1.0, cmap='RdPu')
X_with_relevant_features = X[relevant_features]
y_with_relevant_features = X_with_relevant_features['SARS-Cov-2 exam result']
X_with_relevant_features.drop(columns=['SARS-Cov-2 exam result'], inplace=True)
XX=X_with_relevant_features.copy()
#Setting new x and y values with the new dataframe.
# # One hot encoding is done for target variable
enc = OneHotEncoder()
Y = enc.fit_transform(y_with_relevant_features[:, np.newaxis]).toarray()
rs = 41 #Random State
X_train, X_test, y_train, y_test = train_test_split(XX, Y, test_size = 0.30, random_state = rs) #Split the data for train and Test with a ratio of 70:30
# Shape of the dataset for train and test , without target Variable
print(" The data set has total of "+str(X.shape[0])+" entries and "+str(X.shape[1])+" features")
Since there was a total of 30 features, the author has used an MLP architecture with 5 hidden layers having 6, 12, 18,24 and 30 neurons ( nodes) in each layer, and training with a learning rate of 0.0001 (default) and running for 400 epochs (iterations)
#Default Neural Network - Multi-Layer Perceptron
modelNN = MLPClassifier(hidden_layer_sizes=(6,12,18,24,30),random_state=rs,
solver='adam',max_iter=400) #Running a deafult multi-layer perceptron with Random state rs= 41
modelNN.fit(X_train, y_train)
print("Parameters used in building this model are:\n")
print(modelNN) # Inforamtion about the NEural Netwrok architecture used
#Test and Train Accuracy
print("\nClassification accuracy on training and test datasets are:\n")
print("Train accuracy:", modelNN.score(X_train, y_train))
print("Test accuracy:", modelNN.score(X_test, y_test))
#Model Prediction
y_pred_modelNN = modelNN.predict(X_test)
print("\n Classification Report: \n")
# Print Classification Report
print(classification_report(y_test, y_pred_modelNN))
#Variation of Cost over the iteration
plt.ylabel('cost')
plt.xlabel('iterations')
plt.title("Learning rate =" + str(0.0001))
plt.plot(modelNN.loss_curve_)
plt.show()
The above graph shows us how the cost (Error) decreases with an increase in several iterations. We can see that with the
above mentioned neural network parameters and architecture, the gradient descent converged local minima producing best
fit model.
Result of best fit model is : 97.93 % Test accuracy and 96.34% Train accuracy within total of 400 iterations and alpha(learning rate) of 0.0001
Understanding Classification Report :
Precision : Ratio of how many positive predictions are actual positive observations to the all positive predictions.
True Positive / ( True Positive + False Positive)
Recall : Ratio of number of positive prediction that are actually postive to total number of prediction.
Recall = True Positive /(True Positive +False Negative)
F1 Score : The harmonic mean (average) of precision and recall. F1-score =(2 X recall X precision)/(recall + precision)
Note: These where explained for positive class (1), it will same for negative class (0)
- Since the data is more banked towards Negative patients and very less than on positive COVID-19 patient,
grid search, with k-fold CV coudnt make much difference.
- When it comes to Dimensionality reduction, Feature importance was very close by and it selected 29 features out 30,
which was dint had any improvement in model learning and interpretation.
When we deploy this analytical model for a decision-making process like one we have now in this scenario, the neural network model can not interpret the result. Since these model output would be a support for health care workers and doctors who are at the frontline of battling this pandemic. It would not be of any use if we gave a model which acts like a black - box, where you put some value from one end and it gives the result. Any slight mistake in the decision taken by them will risk a large amount of population, not only the patient.
To resolve this issue and give a better insight for health care professionals on the test report,(also to
resolve the ethical concern of approval Board, for this methodology implementing in health sector)the author has
provided the best fit model with a model with interpretability feature.
# Finding Confuation Matrix - Whihc gives an insights on how well the model has predicted the output and what the ratio of false positives and False Negatives
groundtruth = enc.inverse_transform( y_test ) #Actual Test result
predictions = enc.inverse_transform(modelNN.predict( X_test )) #Predicted Test Results
mat = confusion_matrix(groundtruth, predictions) #Creating Confusion Matrix
print(mat.T)
#Creating the HEatMAp of Confusion Matrix
sns.heatmap(mat.T, square=True, cbar=True, xticklabels=["Negative", "Positive"], \
yticklabels=[" Negative", "Positive"], annot=True, cmap=cm.viridis)
#Grpah Paramters
plt.xlabel('true label')
plt.ylabel('predicted label');
Let's have a look at how well the model is predicting in comparison with the actual result,
#Merging PRedicted Test result and actual test results
pred = enc.inverse_transform(modelNN.predict( X_test ))
X_Converted=XX.to_numpy()
display_Limit = 0
for patient_indx in range(0, len(X)):
patients_feat = X_Converted[patient_indx,:]
patients_true_pred = enc.inverse_transform(np.expand_dims(Y[patient_indx,:], 0))[0][0]
# prediction
pred = modelNN.predict(np.expand_dims(patients_feat, 0))
pred = enc.inverse_transform( pred )[0][0]
#for the purpose of display only 10 vlaues are shown, Commenting if Stament will allow to print all patient information
if display_Limit <= 10:
print("Patient Number : %d \t Patient id : %s \t Predicted: %s \t True Diagnosis: %s\n"
%(patient_indx,df3['Patient_iD'][patient_indx], "Positve" if pred else "Negative",
"Positive" if patients_true_pred else "Negative"))
display_Limit +=1
So let's consider four types of results we observe from the model, which are very crucial for the doctors and healthcare professionals. which are False positive, False Negative, True Positive?
- Patient Number 557 - predicted Positive, True value Negative -False Positive
- Patient Number 1467 - predicted Negative, True value Positive - False Negative
- Patient Number 1368 - predicted Negative, True value Negative - True Negative
- Patient Number 1470 - predicted Positive, True value Positive - True Positive
#Selcting all the feature from the main DataSet
feature_names = XX.columns.to_list()
len(feature_names)
feature_names
# LIME has one explainer for all the models
# very important for feature transformation
X_train=X_train.to_numpy()
MAX_FEAT = 15 #Features to be considered for Explanability
explainer = lime_tabular.LimeTabularExplainer(X_train, feature_names= feature_names, class_names=["Negative", "Positive"], verbose=False, mode='classification')
# patient num 557 predicted Positive, true value Negative -False Positive
patient_indx = 557 # Patient Number
# Features of given patient - Test results of given patient number
patients_feat = X_Converted[patient_indx,:]
# Convert back from one hot encoding 1-D array, True(Actual) value of Test Result
patients_true_pred = enc.inverse_transform(np.expand_dims(Y[patient_indx,:], 0))[0][0]
# prediction
pred = modelNN.predict(np.expand_dims(patients_feat, 0)) # For given patient Number predicted values
pred = enc.inverse_transform( pred )[0][0] # converting back from one hot encoding to single value, for predicted test result
#Printing patient number, predicted test result and actual test result
print("Patient Number : %d \t Patient id : %s \t Predicted: %s \t True Diagnosis: %s\n"
%(patient_indx,df3['Patient_iD'][patient_indx], "Positve" if pred else "Negative", "Positive" if patients_true_pred else "Negative"))
# explain instance
exp_type01 = explainer.explain_instance(patients_feat, modelNN.predict_proba, num_features= MAX_FEAT )
# Show the predictions
exp_type01.show_in_notebook(show_table=True )
How confident is the model the given patient is COVID-19 positive or negative ( out of 1 ), to convert it into a percentage in this scenario we can say the model is 73% sure that the patient is Positive and 26 % doubt that, it may be negative.
- For the given scenario this would tell the healthcare professional that how confident one can be with the results, in this scenario since its False positive, the probability indicates that there is a 26% chance of being negative.
The visualisation in the middle of the screen shows that, how each feature are contributing to the model prediction, the value just above the horizontal bar graph indicates the confidence values, and its in the order of features with highest confidence values to lowest ( considering top 10 features )
- For the given Scenario, after getting the probability of chance, the doctors may want to see, depending on what features the results are shown. based on their expertise and medical knowledge then can decide on what can be done for the patient.
The rightmost visualization shows us the actual values of these top 10 features from the dataset.
- For the given scenario, When doctors going to make a decision, these values help than to give a quick sneak peek of the test results.
- Note: Studies have shown that COVID-19 effect commonly to Red blood Cells,RBC's (Amdahl, 2020), from there the effect on RBC's echos to other elements. So we can say as a health professional, our first glance would be test realting to RBC's ( there are different test which promarily based on RBC's), how the model supporting ( confidence interval) for RBS realted tests.
So to summarise this situation, looking at the probability model and confidence value, since the confidence of being positive, features wise also not significantly high, doctors/health professionals may have to do a specify the extra test to confirm it whether is negative, since one of the RBS test is in support of COVID-19 being negative. Confidence interval of Features and its values supporting healthcare professionals to take necessary steps, not only that but also to make this decision making the process faster.
A much more detailed split of the confidence interval is given below, to get confidence result with higher resolution.
#print(exp_type01.as_list()) #Print all the Lsit of top features and its Effect
explanation_plot = exp_type01.as_pyplot_figure() #Plot showing variation in feature importance
### patient num 1467 predicted Negative. True value Positive - False Negative
patient_indx = 1467 # Patient Number
# Features of given patient - Test results of given patient number
patients_feat = X_Converted[patient_indx,:]
# Convert back from one hot encoding 1-D array, True(Actual) value of Test Result
patients_true_pred = enc.inverse_transform(np.expand_dims(Y[patient_indx,:], 0))[0][0]
# prediction
pred = modelNN.predict(np.expand_dims(patients_feat, 0)) # For given patient Number predicted values
pred = enc.inverse_transform( pred )[0][0] # converting back from one hot encoding to single value, for predicted test result
#Printing patient number, predicted test result and actual test result
print("Patient Number : %d \t Patient id : %s \t Predicted: %s \t True Diagnosis: %s\n"
%(patient_indx,df3['Patient_iD'][patient_indx], "Positve" if pred else "Negative", "Positive" if patients_true_pred else "Negative"))
# explain instance
exp_type02 = explainer.explain_instance(patients_feat, modelNN.predict_proba, num_features= MAX_FEAT )
# Show the predictions
exp_type02.show_in_notebook(show_table=True )
- A much more detailed split of the confidence interval is given below, to get confidence result with higher resolution.
#print(exp_type02.as_list()) #Print all the Lsit of top features and its Effect
explanation_plot = exp_type02.as_pyplot_figure() #Plot showing variation in feature importance
### patient num 1368 predicted Negative. True value Positive - False Negative
patient_indx = 1368 # Patient Number
# Features of given patient - Test results of given patient number
patients_feat = X_Converted[patient_indx,:]
# Convert back from one hot encoding 1-D array, True(Actual) value of Test Result
patients_true_pred = enc.inverse_transform(np.expand_dims(Y[patient_indx,:], 0))[0][0]
# prediction
pred = modelNN.predict(np.expand_dims(patients_feat, 0)) # For given patient Number predicted values
pred = enc.inverse_transform( pred )[0][0] # converting back from one hot encoding to single value, for predicted test result
#Printing patient number, predicted test result and actual test result
print("Patient Number : %d \t Patient id : %s \t Predicted: %s \t True Diagnosis: %s\n"
%(patient_indx,df3['Patient_iD'][patient_indx], "Positve" if pred else "Negative", "Positive" if patients_true_pred else "Negative"))
# explain instance
exp_type02 = explainer.explain_instance(patients_feat, modelNN.predict_proba, num_features= MAX_FEAT )
# Show the predictions
exp_type02.show_in_notebook(show_table=True )
So to summarise this situation, looking at the probability model and confidence value doctors/heatlh professionals can confirm it whether is actually Negative, since Confidence interval of Features including RBC's test support of outcome being negative. This type of patients effective filtering helps to lower the burden on health care professional, not only that but also to supports fasten the decision making process.
A much more deatiled split of confidence interval is given below, to get confidence result with higher resolution.
# patient Number 1470 predicted Positive. true value Positive - True Positve
patient_indx = 1470# Patient Number
# Features of given patient - Test results of given patient number
patients_feat = X_Converted[patient_indx,:]
# Convert back from one hot encoding 1-D array, True(Actual) value of Test Result
patients_true_pred = enc.inverse_transform(np.expand_dims(Y[patient_indx,:], 0))[0][0]
# prediction
pred = modelNN.predict(np.expand_dims(patients_feat, 0)) # For given patient Number predicted values
pred = enc.inverse_transform( pred )[0][0] # converting back from one hot encoding to single value, for predicted test result
#Printing patient number, predicted test result and actual test result
print("Patient Number : %d \t Patient id : %s \t Predicted: %s \t True Diagnosis: %s\n"
%(patient_indx,df3['Patient_iD'][patient_indx], "Positve" if pred else "Negative", "Positive" if patients_true_pred else "Negative"))
# explain instance
exp_type04 = explainer.explain_instance(patients_feat, modelNN.predict_proba, num_features= MAX_FEAT )
# Show the predictions
exp_type04.show_in_notebook(show_table=True )
- A much more detailed split of the confidence interval is given below, to get confidence result with higher resolution.
#print(exp_type04.as_list()) #Print all the Lsit of top features and its Effect
explanation_plot = exp_type04.as_pyplot_figure() #Plot showing variation in feature importance
Amdahl, M. (2020, April 11). Covid-19: Debunking the Hemoglobin Story. Retrieved from https://medium.com/: https://medium.com/@amdahl/covid-19-debunking-the-hemoglobin-story-ce27773d1096
Anderson, R., Heesterbeek, H., Klinkenberg, D., & Hollingsworth, T. (2020, March 09). How will country-based mitigation measures influence the course of the COVID-19 epidemic? 395, 931-934. doi:10.1016/S0140-6736(20)30567-5
CDC, C. (n.d.). Coronavirus Disease 2019 (COVID-19)-Protect Yourself. Retrieved May 22, 2020, from https://www.cdc.gov/: https://www.cdc.gov/coronavirus/2019-ncov/prevent-getting-sick/prevention.html?CDC_AA_refVal=https%3A%2F%2Fwww.cdc.gov%2Fcoronavirus%2F2019-ncov%2Fprepare%2Fprevention.html
Drum, K. (2020, April 2). US Coronavirus Test Is Only 60-70% Accurate. Retrieved May 22, 2020, from https://www.motherjones.com/: https://www.motherjones.com/kevin-drum/2020/04/wsj-us-coronavirus-test-is-only-60-70-accurate/
Evans, M. (2020, April 04). Coronavirus home test kits: Everything you need to know. Retrieved May 22, 2020, from https://www.t3.com/: https://www.t3.com/au/news/coronavirus-home-test-kits-when-theyll-be-on-sale-where-to-buy-and-more
Guardian, T. (2020, April 11). WHO looks into report of Covid patients testing positive after negative tests. Retrieved from https://www.theguardian.com/: https://www.theguardian.com/world/2020/apr/11/who-looks-into-report-of-covid-patients-testing-positive-after-negative-tests
John Hopkins , U. (n.d.). CORONAVIRUS RESOURCE CENTER. Retrieved 05 22, 2020, from https://coronavirus.jhu.edu/map.html
Schofield, C. (2020, May 14). A new antibody test for coronavirus with 100% accuracy has been approved - here’s how it works. Retrieved May 22, 2020, from https://www.morpethherald.co.uk/: https://www.morpethherald.co.uk/read-this/new-antibody-test-coronavirus-100-accuracy-has-been-approved-heres-how-it-works-2853193
Stevens, H. (2020, March 14). Why outbreaks like coronavirus spread exponentially and how to flatten the curve. Retrieved May 22, 2020, from washingtonpost: https://www.washingtonpost.com/graphics/2020/world/corona-simulator/
WHO. (2020, March 9). WHO Director-General's opening remarks at the media briefing on COVID-19 - 9 March 2020. Retrieved from https://www.who.int/: https://www.who.int/dg/speeches/detail/who-director-general-s-opening-remarks-at-the-media-briefing-on-covid-19---9-march-2020